Creating an African American-Sounding TTS: Guidelines, Technical Challenges, and Surprising Evaluations
Supplementary Audio Material. Submitted to IUI 2024
Table 1: Natural recordings from the selected AA voice. Samples marked with (*) were used as part of Study 3.
Samples (non-synthetic)
1
2 (*)
3 (*)
4 (*)
5
Table 2: Comparison of single- and multi-speaker models. The multi-speaker model was used to generate the final samples for Studies 1 and 2 (see Table 3).
#
Single-Speaker Model
Multi-Speaker Model
Text
1
OK, call us back if you run into any other issues, and enjoy the rest of your afternoon! Bye!
2
Yes, I'm back! Thank you so much for holding! I really appreciate your patience!
3
I'm not really sure what's causing this delay. Looks like the item is in stock? Let me take a closer look.
4
Yes, would you mind holding the line just a bit longer? Sorry to do this to you again, but I'm having some issues retrieving your account.
5
Um, sorry, but I believe there are no direct flights out of your preferred airport.
Table 3: Synthetic samples used in Studies 1 and 2 for the AA and WH voices. To prevent listeners from being exposed to the same text more than once, a different set of sentences was used for each voice.